TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition
نویسندگان
چکیده
In recent years, there has been a great deal of research in developing end-to-end speech recognition models, which enable simplifying the traditional pipeline and achieving promising results. Despite their remarkable performance improvements, models typically require expensive computational cost to show successful performance. To reduce this burden, knowledge distillation (KD), is popular model compression method, used transfer from deep complex (teacher) shallower simpler (student). Previous KD approaches have commonly designed architecture student by reducing width per layer or number layers teacher model. This structural reduction scheme might limit flexibility selection since structure should be similar that given teacher. cope with limitation, we propose new method for recognition, namely TutorNet, can across different types neural networks at hidden representation-level as well output-level. For concrete realizations, firstly apply (RKD) during initialization step, then softmax-level (SKD) combined original task learning. When trained RKD, make use frame weighting points out frames pays more attention. Through experiments on LibriSpeech dataset, it verified proposed not only distills between topologies but also significantly contributes improving word error rate (WER) distilled student. Interestingly, TutorNet allows surpass its teacher's some particular cases.
منابع مشابه
Towards End-to-End Speech Recognition
Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which al...
متن کاملTowards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the...
متن کاملTowards Language-Universal End-to-End Speech Recognition
Building speech recognizers in multiple languages typically involves replicating a monolingual training recipe for each language, or utilizing a multi-task learning approach where models for different languages have separate output labels but share some internal parameters. In this work, we exploit recent progress in end-to-end speech recognition to create a single multilingual speech recogniti...
متن کاملTacotron: Towards End-to-End Speech Synthesis
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. G...
متن کاملEnd-to-end Audiovisual Speech Recognition
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-toend audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2021
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2021.3071662